Bayesian finite mixtures: a note on prior specification 

and posterior computation 



Agostino Nobile* 
University of Glasgow, Scotland. 

May 2005 



Abstract 

A new method for the computation of the posterior distribution of the number k of components in 
a finite mixture is presented. Two aspects of prior specification are also studied: an argument is 
made for the use of a Poi{l) distribution as the prior for k; and methods are given for the selection 
of hyperparameter values in the mixture of normals model, with natural conjugate priors on the 
components parameters. 
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1 Introduction 

Finite mixture distributions have become widely used as a tool of semi-parametric inference: they 
partake of the conceptual simplicity of parametric models and of the flexibility of non-parametric 
ones. This paper is a contribution to the Bayesian analysis of finite mixtures with an unspecified 
number of components. I give arguments to support the use of a Poi(l) prior for the number of 
components and present a new method for the numerical computation of its posterior. The method 
exploits a fundamental probability identity already used by Chib (1995), but combines it with the 
representation of mixture marginal likelihoods given in Nobile (2004). I also discuss a more specific 
topic, hyperparameter selection in a finite mixture of univariate normals. Throughout the paper, 
the galaxy data set is used for illustrative purposes. 

The remainder of this section provides a brief introduction to Bayesian finite mixtures, rep- 
resentations of the associated marginal likelihoods and the mixture of normals model. Section [2] 
deals with the estimation of marginal likelihoods using the frequency of empty components in a 
Markov Chain Monte Carlo sample of the mixture allocations. Section [3] argues that the structure 
of the model suggests the Poi(l) distribution as a suitable prior for the number of components, 
when no substantive information on it is available. Section U] concerns the more practical issue of 
hyperparameter determination in the mixture of normals model, when natural conjugate priors on 
the means and variances of the components are employed. 
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1.1 Bayesian finite mixtures 

A finite mixture is a distribution with density, with respect to some underlying measure, given by 

k 

f{x) = Y,\,p,{x\e,). (1) 

The weights Aj are non-negative and sum to 1, while the component densities Pj{-\Oj) belong to some 
known parametric family. Observations garded as proceeding from the distribution 

([1]) and interest lies in the number of components k and, conditional on k, in the weights A and the 
components' parameters 9. 

The model can be rewritten by introducing latent allocation vectors g = (51, •••,(?«,) with 
Qi G {1, . . . , A:} denoting the mixture component that generated the i-th observation: 



Prbi = 3\k, A] = Aj z = 1, . . . , n, independently 

Xi\k,g,\e pj{xi\ej), j = gi, i = l,...,n. 



(2) 



In the Bayesian analysis of the model, typically one assumes that A|A; ~ Dir{ai, . . . ,ak), where 
the aj's are fixed constants. Also, the component parameters 9j are assumed a priori independent, 
conditionally on k and, possibly, a vector of hyperparameters cj): 

k 

7r{e\k,(^) = ll7Tj{ej\(^j). 

If a prior distribution TT{k) is specified, then one can obtain a sample from the joint posterior of 
{k,X,9) by means of Markov chain Monte Carlo methods, see e.g. Richardson and Green (1997), 
Phillips and Smith (1996), Stephens (2000a), Nobile and Fearnside (2005). Inference about A 
and 9 is not straightforward, because the likelihood is invariant with respect to permutations of 
the components' labels. Achieving identifiability by imposing constraints on the parameters does 
not always work and other methods have been proposed, see Richardson and Green (1997) and 
its discussion (especially the contributions of G. Celeux and M. Stephens), Celeux, Hurn and 
Robert (2000), Stephens (2000b), Friihwirth-Schnatter (2001), Nobile and Fearnside (2005). 

An alternative to sampling from the posterior of {k, A, 6) consists of estimating the marginal 
likelihoods of the mixture model with k components: 



fk:=f{x\k)= f{x\k,X,9)7r{X,9\k)dXd9, k = l,2,...,k 



max- 



Each marginal likelihood estimate makes use of MCMC output for a model with fixed k. The 
estimates can be used to compute Bayes factors for k vs. k — 1 components, or to compute the 
posterior of k, 7r(A;|x) oc ■7r{k)fk- It should be noted that estimation of the marginal likelihood from 
MCMC output is not as simple as other posterior inference using MCMC, and as a consequence 
several methods have been proposed, see e.g. Chib (1995), Raftery (1996), DiCiccio et al. (1997), 
Gelman and Meng (1998) and references therein. 

1.2 Marginal likeHhoods of finite mixtures 

The marginal likelihoods can be rewritten as 

fk= Y.f(9\k)f{x\k,g) (3) 

g&Qk 
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where the sum extends over the set Gk of all the allocation vectors with entries less than or 
equal to k, see e.g. Nobile (1994, 2004). In equation ([3]), f{g\k) = J f{g\k, X)TT{\\k) dX = 

k 

r(Q0A:) TT ""J-^ with aok = J^^^-i Oij and n,- equal to the number of observations that 
r(aofc + n)ii T[aj) ^J-i ^ ^ 

g allocates to component j. The other term in the right hand side of ([3]), f{x\k,g), is obtained 
by integrating f{x\k,g,6) from ^ with respect to the prior distribution of 9\k. Although this 
integration can be performed in closed form only for some prior distributions, notably natural 
conjugate priors on 6, representation ([3]) is always valid. Under the assumption that the Dirichlet 
hyperparameters aj and the prior distributions 7rj(-\(j)j) remain the same for fixed j as k varies, the 
marginal likelihoods enjoy further representations. Partition the set of allocation vectors Qk as 

k 

Gk = [j Gt, Gt n = 0, t^s 
t=i 

where G* is the set of allocation vectors which assign at least one observation to component t and 
none to higher components. Also, let /* be the portion of the marginal likelihood ft that accounts 
for vectors g allocating at least one observation to component t and none to higher: 

ft= Y.f{g\t)f{x\t,g). (4) 

Then one can show (see Nobile 2004, page 2049) that, for all g ^ Gt with t < k, 

f(.\k.s) = mi,s) and n-<>^) n^<>'+''> ^,,,, ,5) 

f{g\t) T{aok + n) r(aot) 

and that 

k 

fk = ^aktft (6) 
t=i 

= ak,k-lfk-l + fk- (7) 



If the prior distribution of {X,9)\k is invariant to permutations of the components labels, a 
stronger result is available. Let Gji be the subset of G* consisting of allocations with h < t non- 
empty components. In particular, any vector g ^ G^ Gh assigns at least one observation to each 
mixture component 1, . . . ,h. Let fl be the portion of fh which corresponds to allocations with no 
empty components: 

fl = Y.f(9\h)f{x\h,g). 



Then (Nobile 2004, page 2053) 



kAn /i\ 

^ = E U (8) 

h=i ^ ^ 



For related representations see Ishwaran, James and Sun (2001). 
1.3 Mixtures of univariate normals 

The method to be presented in the following section is of general applicability. Since mixtures of 
univariate normals will be used as an illustration, I introduce here some notation. It is assumed 
that the component densities Pj(-|^j) are normal with mean rrij and variance r~^: 

md. 



Xi\k,g,X,e N{mj,r-'^), j = gi, i = 1, 



,n. 
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Independent natural conjugate priors are placed on 9j = {rrij, rj), j = 1, . . . , k: 

Tj GaH, S) 

ind (9) 
rrijlrj iV(/u, {rrj}"^), 

with E{rj) = j/6. See Diebolt and Robert (1994) or Nobile and Fearnside (2005) for more details. 
Other priors on 9, such as the independent prior used by Richardson and Green (1997), could be 
used as well. 

The prior distribution ([9]) requires the specification of four hyperparameters. The overall mean 
fi is set to a round value close to the sample mean x. The shape parameter 7 is half the degrees 
of freedom of the prior predictive t distribution. I choose 7 = 2, to have a prior predictive, with 
relatively thick tails, but finite second order moments. The choice of the scale parameter 5 and of 
r, the prior ratio between within components variance and variance of the means, is discussed in 
Section HI 



2 Marginal likelihoods from empty components 

For a given parametric model f{x\9) and prior distribution f{9), the marginal likelihood of the 
observed data x is defined as f{x) = J f {x\9) f {9) d9 . Using Bayes theorem, f{x) can be rewritten 

fix) = ^('^^("'^^ (10) 

where f{9) and f{x\9) are assumed computable, including their normalizing constants, and the 
formula holds for any parameter value 9. Expression (|10p forms the basis of a method of marginal 
likelihood estimation, see Chib (1995) and Raftery (1996). In short, although typically the posterior 
f{9\x) cannot be evaluated exactly, an estimate of it at some parameter value 9 can be obtained 
using a Monte Carlo sample; substituting this estimate into ([TO]) yields an estimate of f{x). 

In the context of Section [1.2[ the marginal likelihood for the model with k components can be 
written as 

, fi9\k)fix\k,g) 

Here the allocation vector g plays the role of 9 in the above discussion and everything is conditional 
on k. Since (jlip holds for all g € Gk, still holds if one sums both numerator and denominator 
over any non-empty set E: 

_ EgeEfi9\k)fix\k,g) 

E,eEfi9\k,x) ■ ^^'^ 

Letting E = Q'^, the denominator of ()12p is the posterior probability of t/^, while the numerator 
equals f^ = fk — Ofc.fc-i/fc-i; using equations ^ and ([7]). One then obtains 



„ fk — 0-k,k-lfk-l 

Ik — 

and after rearranging 



Fr[gi\k,x] 
fk flfc.fc-l 



fk-i i-Fi[gi\k,xY ^^^^ 

The left hand side of is the Bayes factor B^^k-i for the model with k components against 
the model with k — 1 components. In the right hand side a^^k-i is a known constant, while the 
denominator is the posterior probability, according to the model with k components, that the k- 
th component is empty, which can be easily estimated using a MCMC sample from f{g\k^x). In 
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some mixture models /i = fi is computable exactly; if this is the case, estimates of the marginal 
likelihoods, if needed, can be readily produced from the Bayes factors. Otherwise, one can still 
obtain estimates of normalized marginal likelihoods, by setting fi = I and then rescaling the 
sequence of fkS. 

Using formula (|13|) is somewhat wasteful, since it only employs the fixed k MCMC sample to 
estimate Pr[^^|A;, x]: the MCMC sample for k+1 components can be used to estimate Pr[C/^|A:+l, x], 
a quantity that is related to the probability in (jl3p . Let Pr[^*|/c,a;] with t < k he the posterior 
probability, conditional on k components, that component t is non-empty and components t + 1 
through k are empty. One can show, see the Appendix, that 

^max 



^ Pr[g*^,\k,x] 
at+i,t . (14) 

Pr[gt\k,x] 



ft+l k=t+l 

n 



k=t+l 



Setting fl = fi if available, or /i = 1 if not, the sequence /j* t = 1, . . . , fcmax can be estimated 
by replacing the probabilities in (114p with MCMC estimates. An application of ([6]), followed by 
rescaling, then produces estimates of the normalized marginal likelihoods. 

Formulae (I13p and ()14p do not assume that the prior on (A, 6)\k is invariant to permutations of 
the components labels, only that the hyperparameters aj, (pj are the same for all k > j. If the prior 
is invariant, the additional symmetry can be exploited as follows. Let 0^ = IJt=/i 
of allocations g in Qf^ which assign observations to exactly h components. Then Pr[^^|A;,x] is the 
posterior probability, conditional on k components, that h < k components are non-empty. One 
can show, see the Appendix, that 



fh+l _ (u I 1 ^ k=h+l 
Jh 



£ VT[Gl^,\k,x] 
{h + 1) . (15) 

r<!max 

^ {k-h)Fr[g^^\k,x] 

k=h+l 

Replacing the probabilities in (jl5p with MCMC estimates and setting /| to 1 (or fi if available), 
yields estimates of the sequence of /^'s; plugging these estimates in formula ([8]) and rescaling 
produces estimates of normalized marginal likelihoods. 

To illustrate the method, formula ()15p was used to compute the marginal likelihood of k com- 
ponents for the galaxy data. This data set consists of velocity measurements (1000 Km/sec) of 82 
galaxies from the Corona Borealis region. Since its appearance in Roeder (1990), it has been stud- 
ied by several authors, see Aitkin (2001) for an interesting comparison of likelihood and Bayesian 
analyses of this data set. The data was modelled as a finite mixture of univariate normals, as set 
out in Section [1.31 The weights hyperparameters a were set to 1, while the other hyperparameters 
were = 20, r = 0.04, 7 = 2 and 6 = 2, their choice is discussed in Section [H In this example I 
used Gibbs sampling of the allocation vectors g, after integrating out the weights and components 
parameters, see Nobile and Fearnside (2005). However, the method applies equally well to the 
Gibbs sampling scheme involving both parameters and allocations, see for instance Diebolt and 
Robert (1994) and Richardson and Green (1997), as long as empty components are allowed. Each 
Gibbs sampler with fixed k was run for 20000 sweeps, with 1000 sweeps of burn-in. The final 
allocation in the run with k components served as the starting allocation for the run with k+1 
components. The estimates of the marginal likelihoods normalized to sum to 1 are displayed as 
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line-joined dots in Figure 1. For comparison, the figure also contains the estimate of the posterior 
of k with uniform prior on k = 1, . . . , /^max = 50 using a different method, the allocation sampler 
of Nobile and Fearnside (2005). This sampler was run for 1 million sweeps with a burn-in of 10000 
sweeps and keeping only one draw every 10. The agreement between the estimates from the two 
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Figure 1: Two estimates of the posterior distribution of k for the galaxy data, using a discrete uniform 
prior on fc = l,...,fcijiax ~ 50. Histogram gives the frequency of model with k components 
using the allocation sampler of Nobile and Fearnside (2005). Line-joined dots are the normalized 
marginal likelihoods using formula (jisp. See main text for hyperparameter values used. 



unrelated methods provides a welcome check on them, all the more so since visual inspection of the 
galaxy data suggests between three and six clusters, while the posterior of k displayed in Figure [T] 
assigns to this range of values a probability smaller than 0.02. If one is to believe the estimates 
in Figure dl as the agreement between the two methods seems to suggest, it would seem that the 
posterior of k has little to tell about the number of clusters in a data set. In the next section I 
argue that this is not the case and that replacing the uniform prior on k with a Poi{l) distribution 
yields a posterior that is more suitable for inference about the number of actual groups in the data. 



3 The prior distribution of the number of components 

In this section I assume that the prior on {X,9)\k is invariant to permutations of the components 
labels. Recall from Section 11.21 that is the part of the marginal likelihood fh corresponding to 
no empty components 

fl = T.f(9\h)f{x\h,g) 



6 



and that representation ([8]) holds: 



fcAn 



To have an understanding of how formula ^ arises, look at Figure [2] which displays the nested 
structure of G4, with each set of digits denoting allocation vectors with entries equal to those digits 
only. From formula ([3]), /4 is the sum over C/4 of f{g\k = 4)f{x\k = 4, g). Formula ([8]) gives this 



® 



13 23 



14 24 34 



124 134 234 



Figure 2: The nested structure of Qk, k = A. Each set of digits denotes a set of allocation vectors with 
entries equal to those digits only. Nested boxes denote Gi, ■ ■ ■ , Ga- Ellipses enclose the sets G^, 
h — 1, . . . ,A. In box Gk, the complement of the enclosed Gk~i box is the set Gk- 



sum in terms of fPs 



which are sums over the sets G^, h 



1, . . . , 4 denoted by ellipses in Figure [2j 
The terms a^h serve to rescale f{g\h) to f{g\k), while the combinatorial terms (^) give the number 
of "copies" of that are present in Gk- 

A consequence of formula ^ is that the marginal likelihood of k components may derive to 
a large extent from allocations with less than k non-empty components. For instance, consider a 
hypothetical data set of n = 80 observations clearly clustered in nine well separated groups, to such 
an extent that fj^/fg is nearly 0, for h ^ 9. Then formula ^ implies that 



fk_ 

h 



OfcQ) 



k>9. 



(16) 



With a = 1, one obtains the values reported in Table [H With a discrete uniform prior on k, 



k 


9 


10 


11 


12 


13 


14 


15 


fk/fg 


1. 


1.011 


0.618 


0.299 


0.127 


0.050 


0.018 



Table 1: Ratio of marginal likelihood /fc/Jg for a hypothetical data set with n = 80 observations, such that 

the posterior of k gives probability less that 1/3 to A; = 9. Put differently, upper bounds on the 
posterior of k can be derived from representation ([8]). Table O taken from Nobile (2004), displays 
upper bounds corresponding to a discrete uniform prior and a = 1. Nobile (2004) contains further 
discussion and tables for a = 2 and a = 0.5. The overall conclusion is that the bounds are weaker 
for larger sample sizes, smaller values of k and larger values of a. 
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n 


1 


2 


3 


4 


5 


k 

6 


7 


8 


9 


10 


20 


0.9000 


0.7286 


0.5299 


0.3456 


0.2880 


0.2419 


0.1954 


0.1756 


0.1505 


0.1335 


50 


0.9600 


0.8847 


0.7826 


0.6645 


0.5414 


0.4233 


0.3175 


0.3119 


0.2835 


0.2402 


100 


0.9800 


0.9412 


0.8858 


0.8170 


0.7385 


0.6541 


0.5677 


0.4828 


0.4023 


0.3322 


500 


0.9960 


0.9880 


0.9762 


0.9607 


0.9417 


0.9193 


0.8938 


0.8656 


0.8350 


0.8022 



Table 2: Bounds on 7r(fc|a;) for several sample sizes n, 7r(fc) = l/kmax, k — 1^ . . . , k^ax = 50, a — 1. 



It is worth mentioning at this point that, as the sample size grows, the marginal likelihood 
of k components will tend to reflect more and more only allocations with no empty components. 
Formally, 

i^^akh^ 5{k=h}, n ^ oo, (17) 

see the Appendix for a proof. Hence, from formula ([8]), /fc — /| ^ as n ^ oo. 

Returning to the example of nine well separated groups, it is the combinatorial term that 
makes /lo > /g in Table [H If one were to drop the (g) term from equation (I16p . the entries in 
Table [1] would be as in Table [3l The (g) term accounts for the fact that with = 10 components. 



k 


9 


10 


11 


12 


13 


14 


15 


fk/ h 


1. 


0.10112 


0.01124 


0.00136 


0.00018 


0.00002 


0.00000 



Table 3: Ratios fk/ as in Table [U but dropping the term (g) from equation ([T6|) . 

there are ten possible ways of choosing nine components to have observations and one component 
to be empty. Of course, this is a consequence of the model entertained and its ability to allow for 
empty components, which correspond to mass on small values for some weights in the prior of A. 
Nonetheless, the increasing effect on the marginal likelihoods, as k grows, of the many ways in which 
some of k components may be empty, is a rather unappealing feature of the model. Nobile (2004) 
has suggested to shift attention from the number of components to the number of non-empty 
components and to compute its posterior distribution. In this paper I follow a different approach: 
trying to counteract the combinatorial terms in the marginal likelihoods by an appropriate choice 
of the prior distribution of k. 

Multiplying equation ([8]) by the prior distribution 7r(A;) and writing the result explicitly for the 
first few k, one has 

7r(l|a;) 
'k{2\x) 

7r(3|x) 

where ^ is a normalizing constant. Although there is no prior ■n{k) which exactly cancels out 
the binomial coefficients (^), one can keep the contribution of to 7r(fe|x) small, relative to its 



^•vr(l)/i 



A 
A 



7r(2)4 + 7r(2)f ja2i/: 



7r(3)4 + 7r(3) ( ^ J a^2fl + vr(3) ( ^ j a,,fl 
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contribution to Tr(h\x), by requiring that 

7r(/i) =sup|7r(A;)Q| h=l,2,.... (18) 

It is easy to verify that a Poi{l) distribution satisfies equations (jlSp . Indeed, every prior TT{k) 
satisfying equations ([18]) is proportional to a truncated Poi{l) distribution, see the Appendix. For 
simphcity, I will take the prior on k to be Poi{l). 

One way to illustrate the effect of the Poi(l) prior on k is by recomputing the bounds on Tr{k\x) 
with this prior; they are reported in Table [H Compared to the bounds with a discrete uniform 
prior in Table [21 they are much weaker, especially for higher values of k. 



n 


1 


2 


3 


4 


5 


k 

6 


7 


8 


9 


10 


20 


0.9525 


0.9114 


0.8756 


0.8441 


0.8162 


0.7913 


0.7690 


0.7488 


0.7306 


0.7140 


50 


0.9804 


0.9619 


0.9445 


0.9280 


0.9124 


0.8976 


0.8836 


0.8703 


0.8576 


0.8455 


100 


0.9901 


0.9805 


0.9712 


0.9621 


0.9533 


0.9447 


0.9364 


0.9283 


0.9204 


0.9128 


500 


0.9980 


0.9960 


0.9940 


0.9921 


0.9901 


0.9882 


0.9863 


0.9844 


0.9825 


0.9806 



Table 4: Bounds on 7r(fc|x) for several sample sizes n, 7r(fc) oc 1/fc!, a — 1. 

For another illustration, reconsider the example of nine well separated groups in Table [H The 
ratio of posterior probabilities 7r(A;|2;)/7r(9|2;) using a Poi(l) prior are given in Table [5l 



k 


9 10 


11 


12 


13 


14 


15 


7r(/c|x)/7r(9|x) 


1. 0.10112 


0.00562 


0.00023 


0.00001 


0.00000 


0.00000 



Table 5: Ratio of posterior probabilities 7r(fc|a;)/7r(9|x) for the same hypothetical data set as in Table [l] 

using a Poi{l) prior on k. 

As a further illustration, return to the galaxy data example in Section [21 Figure [3 contains the 
posterior of k computed using the same hyperparameters and methods as in Figure [H but with a 
Poi{\) prior on A;, rather than discrete uniform. Other examples, for real and artificial data sets, of 
posterior distributions of k based on a Poi{l) prior can be found in Nobile and Fearnside (2005). 

4 Mixtures of normals: hyperparameter selection 

This section is concerned with the choice of hyperparameters in mixtures of univarite normals, with 
natural conjugate priors on the means and variances. The method to be described can be readily 
adapted to the case of multivariate components, or components from other parametric families. 
I continue to use the galaxy data set for illustrative purposes. The marked sensitivity to prior 
specification exhibited in the analysis of this data is, in my experience, far from typical. However, 
it demonstrates well what difficulties may arise. Patterns of dependence of the marginal likelihood 
on the prior of 9 are likely to be simpler in one-parameter families; see Aitkin (2001, page 289) for 
a related remark. 

Figure [H displays estimates of the posterior distribution of the number of components for the 
galaxy data, corresponding to several values of the hyperparameters r and 5. The other hyperpa- 
rameters were set to = 20 and 7 = 2, as discussed in Section [L3l The prior on k was Poi{l) and 
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k 

Figure 3: Two estimates of the posterior distribution of k for the galaxy data, using a Poisson(l) prior. 

Histogram gives the frequency of model with k components using the allocation sampler of Nobile 
and Fearnside (2005). Solid line is the posterior of k obtained from the normalized marginal 
likelihood using formula (|15p . See main text for hyperparameter values used. 

computations were done using formula (jl5[) . Although in all plots most of the mass is concentrated 
on values of k between 2 and 8, a simple glance at the figure conveys how dependent on hyperpa- 
rameter values -7r(A;|x) may be. One can also see that, for given 5, as r increases at first posterior 
mass shifts to higher values of k, and then it moves back to lower values of k. The behaviour for r 
fixed and 5 increasing consists, apart for few exceptions, of a shift of probability mass from higher 
to lower values of k. Most pairs (r, 5) yield negligible posterior mass for k = 2. However, some 
pairs in the upper right corner of the plot assign considerable mass to it. These pairs correspond 
to a prior distribution that makes likely high values of the variance within each normal component; 
in turns this makes it plausible to place in a single group the smallest and largest observations in 
the galaxy data, with a central group accounting for most of the other observations. 

Putting a hyperprior 7r(r, 6) on the two hyperparameters, and sampling from the joint posterior 
of all the unknowns, including r and 5, did not solve the problem. Some experimentation with a few 
hyperpriors showed that ■K{k\x) was to a considerable extent affected by the choice of hyperprior: 
the marginal posterior distributions of r and 5 had very long tails and changed markedly with 
7r(r, 5). For this reason, I preferred to adopt an empirical Bayes stance and estimate r and 5 rather 
than mixing with respect to their posterior distribution. 

I settled on independent priors: f7n(0,l) for (1 + t)-\ the prior proportion of variance within 
a component to the total variance, and C/n(0, 5u) for 5, where (5(7 = (7 — l)s^ and s^. is the sample 
variance. The choice of 5u yields a prior expectation of the components variance equal to The 
random walk Metropolis-Hastings algorithm was used to update r and 5 given all other variables. 
Figure [5] displays boxplots of the marginal posterior distributions of r and (5, on a logarithmic scale, 
conditional on k. Both plots display a pattern whereby a clear change of level occurs as k increases. 
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Figure 4: Posterior distribution of k for the galaxy data, using a Poi{l) prior, for several values of the 

hyperparameters r and S. 



The procedure I used to estimate r and 5 consists of taking the median of the posterior draws, after 
discarding those corresponding to values of k preceding the point where a rough level-off of the 
medians has occurred. The rationale is that if increasing fc by 1 markedly changes the posteriors of 
T and 6, it is because it affords a considerable reduction of the within-components variability, by 
replacing it with between-means variability. The median of r seems to level off at = 4. For 5 the 
picture is less clear, but the decreases are much smaller past fe = 6. The end result are the rough 
estimates f = 0.04 and 6 = 2. These values were used in the runs reported in Sections [2] and [3l A 
similar procedure was used by Nobile and Fearnside (2005), to which I refer for further examples. 

The overall lesson seems to be that estimates of 7r{k\x) provide only a rough, though useful, 
guide to the number of groups in the data and that there is really no substitute for the kind of 
sensitivity analysis performed in Figure [H 

Appendix 

A.l Proof of formula ( IT4l) 

Letting E = Q* in formula (|12p . with t < k, yields 

_ T^geg* fi9\k)fix\k,g) 
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Figure 5: Boxplots of draws from the posterior distributions, conditional on k, of r in panel (a) and of 6 

in panel (b). Whiskers are drawn at the 0.005 and 0.995 quantiles. 



Now from formulae ([5]) the numerator equals aktft-, so that solving for produces 

It — Jk- 

O-kt 

Then taking the ratio between f*^i and /* and rearranging terms gives 

/.'Vi PT[gt\k, x] = at+i,t ft ^AQt+i\k, x] t<k 

where one uses ak^t/0'k,t+i = 0-1+1,1, a simple consequence of ([5]). Summing both sides over k from 
t + 1 to kmax and rearranging yields ([H 



A. 2 Proof of formula ( iTSf ) 

Let £' = in formula ()12p to obtain 



fk 



T.g(.gk fig\k)f{x\k,g) 
Pr[g^\k,x] 



h< k. 



The numerator is equal to {f^)o,khfh (see Nobile 2004, Proof of Proposition 4.3), so that rearranging 
one has 



Okh 



Taking the ratio between fl_^_i and and rearranging terms produces 

fl^, {k - h)Fv[g^^\k,x] = fl (h + l) a;,+i,,Pr[gti|A;,x]. 



Finally, sum both sides over k = h + 1, . 
A.3 Proof of formula ([171) 



) ^max 



and solve for f]i_^i/ 1\ to obtain (fTSll . 



From formula ([5]) and under the assumption that the prior is invariant with respect to permutations 
of the labels, 

^k\ /k\ T{ka) T{ta + n) 



t J T{ka + n) T{ta) 
12 



t < k 



Using formula 6.1.46 in Abramowitz and Stegun (1964), n^^ T(ka + n) ^ ^ Hence 



Therefore, for t < k, {j-)akt ^{k=t} as n ^ cx). 

A. 4 Proof that every distribution satisfying equations (llSft is truncated Poi{l) 

The proof proceeds as follows: assume that 7r{k) = for k larger than some value k, use induction 
to derive 7r{k) with k < k, finally let k oo. Suppose that, for all A; = j + 1, . . . , A:, one has 

- fc' 

n{k)=7T{k)- (19) 



Then equation p9]) also holds for k = j: 



7r(j) = sup <^ TT{k) 

k>j I \J 



max_< 7r(/cj- 



j<fc<fc I k\j\{k-j)\ 
= nikf- 

where the first line uses (jlSp while the second follows from (jl9p and 7r(A;) = for k > k. Since 
equation (fTUj) clearly holds for A; = /c, an appeal to induction yields 

- A;' 1 - 
7r(A;) = 7r(A;)— oc — A; = 1,...,A;, 

i.e., Poi{l) restricted to 1 < A; < A;. Letting A; — > oo yields a Poi(l) distribution restricted to A; > 1. 
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